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Abstract 

Being able to see a talking face confers a considerable advantage for speech perception in adulthood. However, behavioural data 
currently suggest that children fail to make full use of these available visual speech cues until age 8 or 9. This is particularly 
surprising given the potential utility of multiple informational cues during language learning. We therefore explored this at the 
neural level. The event-related potential ( ERP ) technique has been used to assess the mechanisms of audio-visual speech perception 
in adults, with visual cues reliably modulating auditory ERP responses to speech. Previous work has shown congruence-dependent 
shortening of auditory N11P2 latency and congruence-independent attenuation of amplitude in the presence of auditory and visual 
speech signals, compared to auditory alone. The aim of this study was to chart the development of these well-established modulatory 
effects over mid-to-late childhood. Experiment 1 employed an adult sample to validate a child-friendly stimulus set and paradigm by 
replicating previously observed effects of N11P2 amplitude and latency modulation by visual speech cues; it also revealed greater 
attenuation of component amplitude given incongruent audio-visual stimuli, pointing to a new interpretation of the amplitude 
modulation effect. Experiment 2 used the same paradigm to map cross-sectional developmental change in these ERP responses 
between 6 and 11 years of age. The effect of amplitude modulation by visual cues emerged over development, while the effect of 
latency modulation was stable over the child sample. These data suggest that auditory ERP modulation by visual speech represents 
separable underlying cognitive processes, some of which show earlier maturation than others over the course of development. 



Research highlights 

• The electrophysiological correlates of audio-visual 
speech perception show a course of gradual matura- 
tion over mid-to-late childhood. 

• Electrophysiological data reveal that the speed of 
processing auditory speech is modulated by visual 
cues earlier in development then is suggested by 
behavioural data with children. 

• In adults, the attenuation of auditory ERP compo- 
nent amplitude by visual speech cues is interpreted as 
an effect of cross-modal competition. 

• It is suggested that the shortening of auditory ERP 
component latency by visual cues in adults may 
represent the prediction of both content and timing 
of the up-coming auditory speech signal. 



Speech is multisensory 

During face-to-face interaction the perception of speech 
is a multisensory process, with visual cues available from 
the talking face according a substantial benefit to adult 
listeners. Audio-visual speech perception has been fairly 
extensively studied in the adult population, yet little is 
understood about the extent to which, or how, children 
make use of these powerful cues when learning language. 
The aim of this study was to illuminate this matter 
through event-related potential (ERP) recordings with a 
developmental sample to establish how visual input 
modulates auditory processing over mid-to-late child- 
hood. 

Visual speech cues, that is movements of the lips, jaw, 
tongue and larynx, correlate closely with auditory output 
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(Chandrasekaran, Trubanova, Stillittano, Caplier & 
Ghazanfai", 2009). Such cues are of particular benefit 
to adult listeners under conditions of auditory noise, 
when their availability can result in improvements in 
response accuracy equivalent to as much as a 15 dB 
increase in the auditory signal-to-noise ratio (Grant & 
Greenberg, 2001; Grant & Seitz, 2000; Sumby & Pollack, 
1954). Visual cues can also create some powerful 
illusions, including the McGurk illusion, where incon- 
gruent auditory and visual inputs result in an overall 
percept derived from but different to the input from each 
sensory modality (McGurk & MacDonald, 1976). For 
example a visual /ga/ dubbed over an auditory /ba/ often 
results in the percept /da/. Other illusions similarly 
involve visual cues altering the perceived content (Green, 
Kuhl, Mehzoff & Stevens, 1991) or location (Alais & 
Burr, 2004) of the auditory signal. 

The development of audio-visual speech 
perception 

Work with infants indicates a very early sensitivity to 
multisensory speech cues. By two months of age infants 
can match auditory and visual vowels behaviourally 
(Kuhl & Meltzoff, 1982; Patterson & Werker, 1999). 
Bristow and colleagues (Bristow, Dehaene-Lambertz, 
Mattout, Soares, Gilga, Baillet & Mangin, 2008) used an 
electrophysiological mismatch negativity paradigm to 
show that visual speech cues habituated 10-week-old 
infants to auditory tokens of the same phoneme, but not 
auditory tokens of a different phoneme. Such evidence 
suggests that infants have a multisensory representation 
of the phonemes tested, or at least are able to match 
across senses in the speech domain. By 5 months of age, 
infants are sensitive to the McGurk illusion, as 
shown both behaviourally (Burnham & Dodd, 2004; 
Rosenblum, Schmuckler & Johnson 1997; Patterson & 
Werker, 1999), and electrophysiologically (Kushnerenko, 
Teinonen, Volein & Csibra, 2008). Notably though, 
audio-visual speech perception may not be robust or 
consistent at this age due to a relative lack of experience 
(Desjardins & Werker, 2004). Nevertheless, infants pay 
attention to the mouths of speakers at critical times for 
language development over the first year (Lewkowicz & 
Hansen-Tift, 2012), during which time they may even use 
visual cues to help develop phonemic categories 
(Teinonen, AsUn, Alku & Csibra, 2008). 

By contrast, children do not seem to show sensitivity 
to, or benefit from, visual cues to the extent that the 
infant data might predict (e.g. Massaro, Thompson, 
Barron & Laren, 1986). Typically, children have been 
shown to be insensitive to the McGurk illusion at age 5, 



then to show a gradual or stepped developmental 
progression to the end of primary school or into the 
teenage years (Hockley & Polka, 1994; McGurk & 
MacDonald, 1976). Reliable responses to this illusion 
emerge at around 8 or 9 years (Tremblay, Champoux, 
Voss, Bacon, Lapore & Theoret, 2007), the same age at 
which children robustly use visual cues to help overcome 
noise in the auditory signal (Wightman, Kistler & 
Brungart, 2006). Ross and colleagues (Ross, Molholm, 
Blanco, Gomez-Ramirez, Saint-Amour & Foxe, 2011) 
demonstrated not only the increasing benefit of visual 
cues over the ages of 5 to 14, but also a change in the 
profile of how useful visual speech cues were under 
conditions of different auditory signal-to-noise ratios. Of 
particular interest in a discussion of developmental 
trajectories is the finding from an indirect measure of 
audio-visual speech perception that, while 5-year-olds do 
not show sensitivity to visual cues, 4-year-olds do (Jerger, 
Damian, Spence, Tye-Murray & Abdi, 2009); hinting at 
a U-shaped developmental trajectory in audio-visual 
speech development. 

This developmental pattern of very early sensitivity 
but late mastery is mirrored in other domains of 
multisensory development. For example, at 4 months 
old infants are subject to low-level audio-visual illusions 
(Kawabe, Shirai, Wada, Miura, Kanazawa & Yamaguci, 
2010; Wada, Shirai, Midorikawa, Kanazawa, Dan & 
Yamaguchi, 2009). However, accuracy in the use of 
information from multiple senses continues to improve 
through childhood, and mastering the ability to appro- 
priately weight information from different senses accord- 
ing to their reliability only emerges from around age 8 
(Gori, Del Viva, Sandini & Burr, 2008). 

Electrophysiological recordings of multisensory 
speech 

The aim of the current work was to understand the 
development of audio-visual speech perception at the 
neurophysiological level. Event-related potential (ERP) 
recordings have repeatedly been used to explore the 
mechanisms of multisensory processing with adult sam- 
ples, largely due to the excellent temporal resolution of this 
technique (Besle, Bertrand & Giard, 2009; Teder-Sale- 
jarvi, McDonald, DiRusso & Hillyard, 2002). In this case, 
we were interested in how visual cues influence, or 
modulate, auditory processing of speech stimuli. The 
auditory Nl and P2 ERP components, often referred to 
together as the vertex potential, are highly responsive to 
auditory speech (e.g. Hoonhorst, Serniclaes, Collet, 
Colin, Markessis, Radeau & Deltenrea, 2009; Pang & 
Taylor, 2000). The characteristics of these early-to-mid 
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latency auditory components, when evoked in response to 
speech stimuU, are modulated by the presence of visual 
speech cues in adults (Bernstein, Auer, Wagner & Ponton, 
2007; Besle, Fischer, Bidet-Caulet, Lecaignard, Bertrand 
& Giard, 2008; Besle, Fort, Delpuech & Giard, 2004; 
Klucharev, Mottonen & Sams, 2003; Pilling, 2009; 
Stekelenburg & Vroomen, 2007; van Wassenhove, Grant 
& Poeppel, 2005). Visual cues are shown to both attenuate 
the amphtude of Nl and P2 as well as, given congruence 
between auditory and visual inputs, shorten their latency 
(Pilhng, 2009; van Wassenhove et al, 2005). While audi- 
tory Nl and P2 are most robustly modulated by visual 
speech, even earlier electrophysiological activity is 
affected. The auditory P50 is attenuated during intracra- 
nial (Besle et al., 2008) and sub-dural (Reale, Calvert, 
Thesen, Jenison, Kawasaki, Oys, Howard & Brugg, 2007) 
recordings over the lateral superior temporal gyrus; and 
even auditory brainstem responses and middle latency 
auditory evoked potentials attenuate in amplitude and 
reduce in latency when participants are able to see a talking 
face (Musacchia, Sams, Nicol & Kraus, 2006). 

Given multiple replications of the modulation of 
auditory Nl and P2 by visual speech cues in adults 
(Bernstein et al, 2007; Besle et al, 2004, 2008; 
Klucharev et al, 2003; Pilling, 2009; Stekelenburg & 
Vroomen, 2007; Van Wassenhove et al., 2005), and the 
correlation of these effects with the perception of 
multisensory illusions (Van Wassenhove et al., 2005), 
this can reasonably be taken to represent at least the 
influence of visual cues on auditory processing, even if 
not necessarily the integration of information at the 
single-neuron level. Here we traced these markers of 
audio-visual speech perception through development. 
Finding either the modulation of amplitude or latency 
of the N1/P2 complex over development could help 
establish the limitations on children's use of multisen- 
sory speech cues. Experiment 1 therefore used an adult 
sample to validate a novel child-friendly paradigm and 
stimulus set by replicating previous findings of congru- 
ence-dependent latency modulation and congruence- 
independent amplitude modulation of auditory Nl and 
P2 by visual cues (Van Wassenhove et al., 2005). Four 
experimental conditions allowed the assessment of the 
impact of visual speech cues on auditory processing: 
Auditory-only, Visual-only, congruent Audio-Visual and 
incongruent audio-visual, referred to as Mismatch. The 
Mismatch condition was included to assess the effect of 
audio-visual congruency and to control for a more 
general effect of attention to the talking face. Experi- 
ment 2 used the same paradigm to trace the develop- 
ment of these modulatory effects over mid-to-late 
childhood, with a sample of children ranging from 6 
to 11 years. 



Experiment 1 

Method 
Participants 

Participants were 12 native English-speaking adults, who 
were naive to the experimental hypotheses (mean age = 
28.10 years, age range = 20.0-34.0 years). Participants 
were recruited through the Birkbeck College participant 
pool and were paid in exchange for taking part. 
Participants gave their written, informed consent. The 
experiment was approved by the Birkbeck College Ethics 
Committee. 

Stimuli 

When studying auditory ERP components in response to 
speech stimuli, previous studies have used repetitive 
consonant-vowel (CV) syllables such as [pa] (e.g. Besle 
et al, 2004) or single vowels (Klucharev et al, 2003). 
Here, the stimulus set was chosen to be as consistent with 
previous studies as possible while maximizing the like- 
lihood that young children would remain attentive and 
motivated. The stimuli therefore consisted of a set of 
monosyllabic, concrete, highly imageable nouns such as 
'bell' and 'pen'. The stimuli were recorded by a 
phonetically trained, female, native English speaker. In 
total 62 nouns were used, 19 of which were animal names 
such as 'cat' and 'pig'. The animal names acted as 
targets during the paradigm and were therefore not 
included in the ERP analysis. Of the 43 non-target 
nouns, 31 began with fricatives and three with affricates 
(of these 1 8 were bilabial, nine were alveolar and seven 
were velar), seven stimuli began with liquids and two 
with a vowel; in total, 29 stimuli began with a voiced 
phoneme. Sharp acoustic onsets were maintained across 
the stimulus set as the auditory Nl is sensitive to changes 
such as rise time (Spreng, 1980). Average age of 
acquisition of the non-target stimuh was 4.2 years {SD 
= 0.9 years) according to American norms (Kuperman, 
Stadthagen-Gonzalez & Brysbaert, 2012), and only two 
of the stimuh ('rose' and 'jam') had an age of acquisition 
marginally above the age of the youngest participant. 

Stimuli were recorded with a digital camera, at 25 
frames per second, and a separate audio recording was 
made simultaneously. Each token was recorded twice 
and the clearest exemplar was used to create the stimulus 
set. Auditory tokens were lined up with their corre- 
sponding visual tokens by matching the points of 
auditory onset in the tokens recorded by the external 
microphone and the video-camera's built-in microphone; 
auditory recordings were made at a sampling rate of 
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44.1 kHz. Each token was edited to be 2000 ms long, 
including an 800 ms period at the start of each clip 
before auditory onset. There were therefore 800 ms 
during which visual articulatory cues were available 
before the onset of auditory information. This allowed 
for the natural temporal dynamics of audio-visual speech 
to remain intact while ensuring that each clip began with 
a neutral face. The length of this period was determined 
by the clip with the latest auditory onset relative to the 
onset of natural mouth movements, thus ensuring that 
no clips were manipulated in order to include this 800 ms 
visual-only period. The audible portion of each clip 
lasted on average 437 ms {SD = 51 ms). 

These tokens were used as the stimulus set for the 
congruent Audio-visual (AV) condition. The stimuli for 
the three other conditions were then derived from them. 
A set of Auditory-only (AO) and a set of Visual-only (VO) 
stimuli were created by splitting the original tokens into 
their auditory and visual components. A final set of 
incongruent audio-visual. Mismatch (MM), stimuli were 
created by mismatching auditory and visual tokens but 
maintaining the relative timing. For example the audi- 
tory token [lake] was dubbed on top of the visual token | 
rose| 800 ms after its onset. Tokens were paired accord- 
ing to onset phoneme, but such that none resulted in an 
illusory percept. Animal tokens were kept separate from 
non-animal tokens when Mismatch stimuli were made, as 
they were task-relevant. 

Procedure 

Testing was conducted in an electrically shielded room 
with dimmed lights. Participants were told that they 
would either see, hear, or both see and hear a woman 
saying words and that whenever she said an animal word 
they should press the mouse button. The button press 
task was included to help maintain the attention and 
motivation of the child participants. The role of atten- 
tion is particularly important here, as the auditory Nl is 
both amplified and shows more temporal precision with 
increased selective attention (Martin, Barajas, Fernandez 
& Torres, 1988; Ritter, Simson & Vaughn, 1988; 
Thornton, 2008). StimuH were presented via headphones 
at approximately 65 dB (SPL), as measured by a sound 
level meter 2 inches from the centre of the ear pad. 
Participants were seated in a chair 60 cm from the 
stimulus presentation screen, and used a chin rest to help 
keep their heads still and ensure that distance from the 
screen was kept constant. 

Participants completed five blocks of 60 trials. Over 
the course of five blocks, 75 stimuli of each condition 
were played, including five animal stimuli per block, 
resulting in a total of 300 trials per participant. In total, 



25 trials were target (animal) trials and were therefore 
not included in the analysis. The 43 non-target nouns 
were each repeated either once or twice in each of the 
four conditions over the course of the experiment. 
Conditions were randomly presented during each block, 
although the stimuli presented in each block were the 
same for each participant. During an audio-visual (AVor 
MM) or Visual-only (VO) trial a fixation screen 
appeared for a random period of time between 100 and 
400 ms, followed immediately by the video clip, as shown 
in Figure 1. The fixation variation was intended to 
minimize expectancy, which has been shown to both 
attenuate Nl amplitude (Lange, 2009; Viswanathan & 
Jansen, 2010) and result in slow wave motor anticipatory 
activity (Teder-Salejarvi et ai, 2002). During Auditory- 
only (AO) trials, the fixation screen remained during the 
stimulus presentation, after the same jittered period 
before auditory stimulus onset as for the other condi- 
tions. Participants were instructed to remain looking at 
the centre of the screen at all times, and deviations of 
gaze were monitored during each session using a video 
camera. Cartoon eyes on a white background were used 
as fixation and were located where the bottom of the 
speaker's nose appeared during video clips. The testing 
procedure lasted around 45 minutes. 

Recording 

High density Electrical Geodesies, Inc. (EGI) caps with 
128 electrodes joined and aligned according to the 
international 10-20 system (Jasper, 1958) were used. 
All bio-electrical signals were recorded using EGI 
NetAmps (Eugene, OR), with gain set to 10,000 times. 
The signals were recorded referenced to the vertex (Cz), 
and were re-referenced to the average during analysis. 




Visual onset Auditory onset 



Figure 1 Example audio-visual trial timeline. 
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Data were recorded at 500 Hz and band-pass filtered 
online between 0.1 and 200 Hz. An oscilloscope and 
audio monitor were used to measure the accuracy of the 
relationship between stimulus presentation and electro- 
physiological recording, and to check the preservation of 
the relationship between auditory and visual stimuli. No 
more than 1 ms difference in disparity between audio 
and visual timing was recorded for any condition. 

Analysis and results 
Analysis 

The region of interest was defined as that which has 
previously been reported as most appropriate for record- 
ing mid-to-Iate latency auditory ERP components (see 
e.g. Giard, Perrin, Echallier, Thevenet, Fromenet & 
Pernier, 1994; Picton, Hillyard, Krausz & Galambos, 
1974). The region comprised five channels around, and 
including, the apex, Cz, which showed the clearest 
auditory components for these data. The two compo- 
nents analysed at this region of interest were the auditory 
Nl and auditory P2, with an average of activity taken 
over the five electrodes. The two ERP measures taken 
were peak-to-peak amplitude and peak latency for the 
Nl and P2 components. Windows of analysis were 
defined as follows: for the PI (the amphtude of which 
was used to analyse the Nl component as the Nl and P2 
were measured as peak-to-peak values) a window from 
40 to 90 ms post stimulus onset was used; for the Nl, 
80-140 ms; and for the P2, 160-230 ms. The analysis 
windows were based on a visual inspection of the grand 
average waveform and checked against data for each 
individual participant. 

Artefact detection was conducted using an automatic 
algorithm to mark channels as bad if activity exceeded 
100 |iV at any point; these data were then checked by 
hand. Trials were rejected if 15 or more channels (12%) 
were marked as bad. Of those trials included in the 
analysis, an average of 1.1 channels (0.9%) were marked 
bad and the data for those channels were interpolated 
from the remaining channels. Participants were included 
in the analysis if they contributed at least 30 non-target 
trials per condition. All adult participants met this 
condition. The average percentage of trials included per 
condition was as follows: AO - 79% (SD = 16.8), VO - 
90% (SD = 9.8), AV - 85% (SD = 10.4), MM - 83% (SD 
= 13.4). 

We directly compared activity in response to the 
audio-visual conditions with that in response to the AO 
condition, as only the modulation of auditory responses 
was of interest for the current purposes. Directly 
comparing unisensory and multisensory conditions 



avoids the issue of subtracting activity common to both 
auditory and visual unimodal responses, which can occur 
when using the more traditional model of comparing 
multisensory activity to the sum of the unisensory 
responses (Stekelenburg & Vroomen, 2007; Teder-Sale- 
jarvi et al., 2002). 

Results 

Behavioural results 

Accuracy of behavioural responses was converted to d', 
with a button press in response to an animal trial 
counting as a hit and any other button press as a false 
alarm. Only responses to AO and AV trials are 
reported here as the main aim of the behavioural task 
was to maintain attention. VO trials are not reported 
as the task was not designed to assess lip-reading 
ability, nor MM trials, due to difficulty in interpreta- 
tion. The average d' for AO trials was 3.7 (SD = 1.4) 
and for AV trials was significantly greater (/(II) = 3.22, 
p = .008) at 5.7 (SD = 2.3). Correlations were run 
between these behavioural measures and each electro- 
physiological measure taken, but none reached signif- 
icance after Bonferroni correction for multiple 
comparisons. 

Electrophysiological results 

The adult electrophysiological data followed the same 
pattern as that seen in previous studies, but with an 
additional effect of amplitude modulation for the P2 
component. A 3 x 2 repeated measures ANOVA was 
run with three levels of Condition (AO, AV, MM) and 
two levels of Component (Nl and P2), for amplitude and 
latency separately. For amplitude, a main effect of 
Condition was found, F(2, 22) = 28.43, p < .001, rip^ = 
0.72, with Bonferroni corrected pairwise comparisons 
revealing differences (p < .05) between each condition, 
AO > AV > MM. An interaction between Condition and 
Component also emerged, F(2, 22) = 9.90, /> = .001, rip^ 
= 0.47 with P2, F(2, 22) = 26.47, p < .001, rip^ = 0.71, 
being more strongly modulated than Nl, F(2, 22) = 
16.33, /;< .001, r|p2 = 0.60. Notably, after Bonferroni 
correction P2 showed significant (p < .01) modulation 
between all levels of Condition, whereas Nl only showed 
a difference between AO and each audio-visual condi- 
tion, at p < .01 (see Figure 2). For latency, there was a 
main effect of condition, F(2, 22) = 4.89, p = .017, r|p^ = 
0.31, driven by the difference (p < .05) between the AV 
condition and the other two conditions, such that AV < 
AO = MM, given Bonferroni correction for multiple 
comparisons. Latency modulation was therefore 
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Figure 2 Peak amplitude and latency of the auditory N1 and P2 components under the Auditory-only (AO), congruent Audio-visual 
(AV) and Mismatch (MM) conditions, for the adult participants. *p = 0.05; **p = 0.01 . 



congruence-dependent. No interaction between condi- 
tion and component emerged. 

Discussion 

The aim of experiment 1 was to replicate in adults 
previous findings of the modulation of auditory ERP 
components by visual speech cues using a child-friendly 
paradigm and stimulus set. 

Adult use of visual cues 

Compared to auditory-only speech stimuli, audio-visual 
stimuli resulted in congruence-independent attenuation of 
Nl and P2 component amplitude and congruence-depen- 
dent shortening of component latency. The modulation 
of auditory ERP components therefore replicated previ- 
ous findings (Pilling, 2009; Van Wassenhove et al., 2005). 
This data set validated the use of the child-friendly 
paradigm on adults for subsequent use with a develop- 
mental sample. 

van Wassenhove and colleagues (2005) proposed that 
the shortening of component latency in the presence of 
visual speech cues represents the use of visual cues to 
predict the content of the upcoming auditory signal; a 
proposal known as the 'predictive coding hypothesis'. 
This is possible in natural speech as the onset of visual 
cues occurs between 100 and 300 ms before their 
auditory counterparts (Chandrasekaran et al., 2009). 
van Wassenhove and colleagues found particularly 



strong support for this notion as latency shortening 
was not only sensitive to congruency but further to the 
degree of ambiguity of the onset phoneme. Greater 
latency modulation was recorded given the syllable [pa] 
over [ta] and given [ta] over [ka]. In this study |pa| was the 
least ambiguous viseme (the visual correlate of an 
auditory phoneme), and as such was suggested to make 
a stronger prediction and result in faster processing of 
the more expected auditory signal. Ease of processing 
has previously been associated with the shortening of 
auditory Nl latency (Callaway & Halliday, 1982). 
Although stimuli in the current study could not be 
analysed by onset phoneme, the congruence-dependent 
shortening of latency further supports the predictive 
coding hypothesis. 

We additionally replicated findings of amplitude 
modulation regardless of congruency between the audi- 
tory and visual inputs, driven predominantly by the P2 
component (Pilling, 2009; Van Wassenhove et al., 2005). 
Two hypotheses have been put forward in the literature 
to explain congruence-independent effects of one sensory 
modality on another, van Wassenhove and colleagues 
suggested that a reduction of amplitude results from 
visual speech cues driving more efficient auditory pro- 
cessing. The authors proposed that redundant informa- 
tion, carried in both senses, need not be fully processed 
by the auditory system, resulting in more efficient 
processing of information available through the auditory 
channel. In the case of visual speech cues, this may entail 
a reduction in processing of information from the second 
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and third formants, which carry information about place 
of articulation. 

An alternative explanation, known as the 'deactiva- 
tion hypothesis' (Bushara, Hanawaka, Immisch, Toma, 
Kansaku & Hallett, 2003; Wright, Pelphrey, Allison, 
McKeown & McCarthy, 2003), asserts that different 
parts of the multisensory processing stream are in 
competition, such that stimuli from different senses 
showing temporal and spatial synchrony produce super- 
additive activity in some areas, but suppression of 
activity in others. Under this view, when multisensory 
stimuli are available, regions that process more than one 
sense dominate over unisensory areas. So, for example, 
responses in auditory cortex are reduced in the presence 
of visual information about the same object or event, as 
multisensory processing regions compete and dominate. 
Experimental evidence from fMRI studies supports the 
theoretical notion of competition between unisensory 
and multisensory areas (Bushara et al., 2003). 

However, in the current data set the attenuation of 
P2 amplitude was greater for the audio-visual Mis- 
match condition than for the congruent Audio-visual 
condition. Given that an incongruent visual cue does 
not provide more reliable information regarding place 
of articulation, nor does it result in the perception of a 
multisensory event, these data are difficult to reconcile 
with either of the above hypotheses. A possible 
explanation lies in the nature of the stimuli used here. 
In the current study, the Mismatch stimuli consisted of 
entirely unrelated words presented in each sensory 
modality, for example, auditory [lake] paired with 
visual |rose| This is in contrast to previous studies 
which have used McGurk stimuli (Pilling, 2009; Van 
Wassenhove et al., 2005), that is, incongruent CV 
syllables which can form coherent percepts despite 
their physical mismatch. 

The current data therefore support an alternative 
hypothesis that amplitude attenuation reflects compe- 
tition between sensory inputs, with competition being 
greater when auditory and visual systems are process- 
ing incompatible, and irreconcilable, stimuli. That this 
effect is restricted to the P2 component is compatible 
with evidence that it originates in posterior superior 
temporal cortex (Liebenthal, Desai, Elhnson, Rama- 
chandran, Desai & Binder, 2010). The posterior supe- 
rior temporal cortex is composed of the posterior 
superior temporal gyrus (pSTG) and sulcus (pSTS) and 
forms part of a network of regions implicated in audio- 
visual speech processing. This network also includes 
primary sensory cortices, frontal and pre-motor regions 
and the supramarginal gyrus (see Campbell, 2008, for a 
review). The pSTS is the most reliably activated region 
in fMRI studies in response to audio-visual over 



auditory speech, and lip-reading (Calvert, Bullmore, 
Brammer, Campbell, Woodruff, McGuire, Williams, 
Iversen & David, 1997; Calvert, Campbell & Brammer, 
2000; Callan, Jones, Munhall, Kroos, Callan & Vatiki- 
otis-Bateson, 2004; Capek, Bavelier, Corina, Newman, 
Jezzard & Neville, 2004; Hall, Fussell & Summerfield, 
2005; Skipper, Nusbaum & Small, 2005). Furthermore, 
pSTS is associated with learning inter-sensory pairings 
(Tanabe, Honda & Sadato, 2005), with auditory 
expertise (Leech, Holt, Devlin & Dick, 2009) and 
shows sensitivity to congruency in ongoing audio- 
visual speech (Calvert et al., 2000). In a systematic 
analysis of the role of pSTS in audio-visual processing. 
Hocking and Price (2008) suggest that this region is 
involved in conceptual matching regardless of input 
modality. 

Given that cortical regions involved in the generation 
of the auditory P2 component are sensitive to matching 
auditory and visual stimuli, the attenuation of P2 may 
reflect competition between neurons in a multisensory 
population responsive to different modalities, with com- 
petition increasing given irreconcilable incongruence. A 
possible next step in the examination of this hypothesis is 
to compare reconcilable (i.e. McGurk) and irreconcilable 
incongruent audio-visual speech stimuli within the same 
paradigm. 

Experiment 2 

Experiment 2 traced the developmental trajectory of 
auditory ERP modulation by visual speech cues from 
age 6 to 12, over which period children establish a 
reliable use of visual cues to aid speech perception as 
shown using behavioural measures (e.g. Wightman et al., 
2006). We sought to determine whether modulation of 
ERPs due to multisensory processing could be observed 
at an earlier age than has been measured behaviourally. 

Method 
Participants 

Thirty-eight typically developing children participated 
(mean age = 8.9 years, SD = 21 months, age range = 6.0- 
11.10 years, with between five and seven children in each 
year group). Children were recruited by placing adver- 
tisements in the local press, and were rewarded for their 
participation with small toys. Parents gave written, 
informed consent for their children. The experiment 
was approved by the Birkbeck College Ethics Commit- 
tee. One child was excluded from the analysis as a result 
of excessive noise in the data. 
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Recording and procedure 

The experimental procedure for children was almost 
identical to that used in Experiment 1 for adult partic- 
ipants. The procedure lasted slightly longer for children, 
around 60 minutes, as more time was spent practising 
sitting still. Blinking was not mentioned as it was judged 
that this would be hard for young children to control and 
would only serve to draw attention to the act. Paediatric 
EGI electroencephalographic nets with 128 electrodes 
were used for all child participants. 

Analysis 

The same region of interest and the same epoch windows 
were used for the child sample based on grand average 
data for each age group and checked against data for 
each individual participant. After artefact rejection, 
slightly more data were discarded as noisy than for the 
adult sample. For child participants, an average of 3.6 
channels (2.8%) were marked bad on accepted trials. As 
per the adults, participants were included in the analysis 
if they contributed at least 30 non-target trials per 
condition; one child was excluded from analysis on these 
grounds. The average percentage of trials included for 
the child sample was: AO - 57% (SD = 14.4), VO - 73% 
{SD = 12.6), AV - 68% (SD = 14.7), MM - 67% (SD = 
13.9). 

Results 

Behavioural results 

The average d' for the child sample was 2.5 (SD = 1.9) 
for AO and 2.7 (SD = 1.9) for AV trials, d' was 
consistently good, with each age group scoring signif- 
icantly above zero on each measure at p < .05, 
indicating satisfactory attention across all ages. 
Behavioural performance improved over developmental 
time, with Age predicting performance on both AO 
(R^ = 0.19, F(l, 35) = 8.26, p = .007) and AV trials (R^ 
= 0.15, F(l, 35) = 6.11, p = .018). Unlike the aduh 
sample in Experiment 1, on this simple detection task 
the child sample showed no behavioural benefit of AV 
trials over AO trials. Correlations between behavioural 
d' and brain responses were calculated for the child 
sample, but again no correlations survived Bonferroni 
correction for multiple comparisons. 

Electrophysiological results 

Figure 3 shows the grand average waveforms for the 
6- and 7-year-olds, the 8- and 9-year-olds, the 10- and 
11 -year-olds as well as the adults from Experiment 1, 



Video onset Audio onset 
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VO 
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Figure 3 Grand average waveforms for each condition, 
Auditory-only (AO), Audio-visual (AV), Mismatch (MM) and 
Visual-only (VO) at the region of interest Waveforms are 
shown divided by age group. The onset points of the visual and 
auditory stimuli are shown. 

with the amplitude and latency values for the auditory 
Nl and P2 components shown in Table 1. These 
categorical age groupings are used here to illustrate 
developmental change but in further analyses age is 
treated as a continuous variable. To assess change over 
time, the developmental data were entered into a 
repeated measures ANCOVA with Condition (AO, AV, 
MM), and Component (Nl, P2) as the within subjects 
factors, and Age (in months) added as a covariate. Main 
effects of Condition were analysed separately in an 
ANOVA (see Thomas, Annaz, Ansari, Serif, Jarrold & 
Karmiloff-Smith, 2009). 

A main effect of Condition emerged, F(2, 72) = 10.16, 
p < .001, rip^ = 0.22, with Bonferroni corrected pairwise 
comparisons revealing differences (p < .01) between AO 
and each multisensory condition, AO > AV = MM. An 
interaction between Condition and Component emerged, 
F(2, 72) = 9.59, p < .001, rip^ = 0.21, with the P2 
component being effected by Condition, F(2, 72) = 17.12, 
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Table 1 hAeans (and standard deviations) for auditory N1 and P2 amplitude (peal< to peal<} and peal< latency, for each age group. 
Latency values are not given for the Visual-only condition, as amplitude values show latent activity within the window of analysis 
rather than components 
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Visual- only 


Audio-visual 


Mismatch 


Nl amplitude (|iV) 


6&7 


2.5 (1.6) 


2.3 (1.4) 


3.3 (2.2) 


2.3 (1.6) 




8&9 


3.5 (3.7) 


1.6 (0.9) 


2.9 (2.1) 


3.1 (3.6) 




lO&ll 


3.7 (1.3) 


1.6 (0.9) 


2.3 (1.1) 


3.0 (1.4) 




Adult 


5.5 (1.4) 


1.5 (0.4) 


4.4 (1.2) 


4.3 (1.4) 
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8&9 


105.5 (10.7) 




105.8 (12.4) 


110.4 (14.0) 




lO&ll 


109.1 (12.5) 




102.0 (12.7) 


105.0 (11.4) 




Adult 


103.3 (11.1) 




95.6 (11.0) 


101.2 (7.0) 


P2 amplitude (nV) 


6&7 


3.4 (2.3) 


1.5 (1.1) 


2.9 (2.3) 


2.2 (2.1) 




8&9 


6.6 (3.7) 


1.8 (1.2) 


4.7 (2.8) 


4.2 (4.0) 




lO&ll 


6.5 (2.7) 


1.5 (1.4) 


4.2 (2.9) 


3.9 (2.2) 




Adult 


10.8 (2.9) 


1.5 (0.6) 


9.1 (2.0) 


8.1 (2.1) 


P2 latency (ms) 


6&7 


195.2 (15.5) 




182.6 (17.1) 


187.0 (20.2) 




8&9 


188.9 (11.6) 




182.0 (13.9) 


183.5 (9.6) 




lO&ll 


195.1 (17.6) 




183.3 (19.8) 


180.4 (11.7) 




Adult 


196.2 (8.7) 




190.0 (9.5) 


196.1 (12.4) 



p < .001, rip2 = 0.32, but not the Nl (/? = .420). Again 
this P2 effect was driven by the difference (p < .001) 
between AO and each multisensory condition (AO > AV = 
MM), as shown by Bonferroni corrected pairwise com- 
parisons. 

There was no main effect of Age, but there was a 
significant interaction between Age and both Compo- 
nent, F(l, 35) = 9.52, p = .004, rip^ = 0.21, and 
Condition, F(2, 70) = 4.05, p = .022, r\p^ = 0.10. The 
first of these interactions was driven by the P2 compo- 
nent showing a main effect of Age, F(l, 35) = 5.31, 
p = .027, rip^ = 0.13, whereas the Nl component did not 
(p = .991). The Age by Condition interaction was driven 
by the AO condition showing a main effect of Age, F(l, 
35) = 4. 14,;? = .050, rip^^O.ll, but not the AV (p = .97) 
or the MM (p = .198) conditions. So, the main effect of 
Condition revealed by the ANOVA seems to have been 
driven predominantly by the older children, and as a 
result of the AO response getting larger over develop- 
ment (as illustrated in Figure 4). 

To further assess the changing relationship between 
Conditions over Age, a linear regression was run with 
Age as a predictor of the difference between AO and 
each audio-visual condition for Nl and P2. Age was 
found to significantly predict the difference between the 
AO and AV conditions for Nl amphtude, = 0.13, F 
(1, 35) = 5.38, p = .026, (3 = 0.365, and P2 amplitude, 
= 0.13, F(l, 35) = 5.077, p = .031, p = 0.356. The 
age at which the difference between conditions became 
significant was determined using the 95% confidence 
intervals around the regression lines (see Figure 5). The 
lower boundary crossed zero at 122 months 
(10.1 years) for Nl amplitude, and at 89 months 
(7.4 years) for P2 amplitude. The increasing difference 



between conditions was approximately equivalent for 
each component. However, Figure 4 suggests that for 
the Nl component, the change in difference results 
predominantly from a decrease in Audio-visual 
response amplitude, while for P2 the change was 
predominantly driven by an increase in Auditory-only 
amplitude. Age did not predict the difference between 
the AO and MM conditions for either the Nl 
(p = .846) or P2 (p = .087) components. 

For latency, the ANOVA revealed a main effect of 
Condition, F(2, 72) = 5.14, p = .008, = 0.13, driven 
by the difference (jy < .05) between the AO and each 
audio-visual condition, AO > AV = MM. An interaction 
also emerged between Condition and Component, F{2, 
12) = 5.52, p = .006, rip^ = 0.13. The P2 component was 
significantly influenced by Condition, F{2, 72) = 7.30, 
p = .001, r|p^ = 0.17, driven by the Bonferroni corrected 
difference {p < .05) between AO and both audio-visual 
conditions; the Nl component was not influenced by 
Condition {p = .128). 

The ANCOVA for latency revealed a main effect of 
Age, F(l, 35) = 4.56, p= .040, rip^ = 0.12, but no 
interaction between Age and Condition (see Figure 4.). 
So, the latency of these auditory components was seen to 
shorten over development, but the effect of Condition 
did not change over this age range. 

All analyses were re-run comparing responses to the 
multisensory conditions with responses to the sum of the 
unisensory conditions. This is a more traditional 
approach adopted in multisensory processing studies 
(see Calvert, 2001). The results of this analysis showed 
the same pattern but with larger sub-additive effects, that 
is, the effect of Condition was exaggerated for all 
comparisons and was therefore less conservative. 
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Figure 4 Developmental trajectories for the Auditory-only (AO), Audio-visual (AV) and Mismatch (MM) conditions for auditory N1 
and P2 peak to peak amplitude and peak latency. 



Discussion 

The influence of visual cues over mid-to-late childhood 

With regard to amplitude, as a group, the children 
responded similarly to the adults, in that the P2 
component was attenuated given congruent and incon- 
gruent visual cues compared to the Auditory-only 
condition. Over developmental time the P2 component 
increased in amplitude, with this effect being driven by 
an increase in response to the Auditory-only condition. 
Age predicted the difference between the Auditory-only 
and Audio-Visual (congruent) conditions for both com- 
ponents, with this effect on P2 predominantly resulting 
from an increased response to the Auditory-only stimuli, 
while for the Nl component a slight decrease in 
amplitude in response to the Audio-visual stimuli seems 
to be responsible. The difference between conditions 
became significant from 10.1 years for the Nl compo- 
nent, and at 7.4 years for the P2 component. The period 
between these two ages matches that seen in behavioural 



studies when visual speech cues come to reliably influ- 
ence auditory perception both in terms of the McGurk 
illusion and audio-visual advantage during speech-in- 
noise (e.g. Tremblay et al, 2007; Wightman et al, 2006). 
These results suggest that the modulation of different 
auditory components represents separate processes in the 
integration and/or use of visual speech cues, and that this 
developmental process may be traced at the behavioural 
level. What is not clear is exactly what the information 
processing correlates of Nl and P2 attenuation might be. 

If amplitude modulation does represent competition 
between inputs from different sensory modalities, as 
suggested above, then the developmental data imply that 
this response only emerges over mid-to-late childhood, but 
is not fully mature by age 12 as the additional amplitude 
attenuation seen in adults to incongruent audio-visual 
stimuli was not seen for the oldest children in this sample. 
This protracted period of maturation maps onto imaging 
data showing regions in superior temporal cortex, which 
contribute to P2 generation in children as they do in adults 
(Ponton, Eggermont, Khosla, Kwong & Don, 2002), do 
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Figure 5 Regression model with age predicting tine difference between the AO and audio-visual conditions for auditory N1 and P2 
amplitude. The arrows show the points at which the lower 95% confidence interval crosses 0 (122 and 89 months, respectively). 



not mature until the teenage years (Gotgay et al., 2004; see 
Lenroot & Giedd, 2006). Recent functional imaging data 
mirror this late development and support the role of STS 
in children's audio-visual speech perception (Nath, Fava 
& Beauchamp, 2011). Dick and colleagues (Dick, Solod- 
kin & Small, 2010) measured brain activity in response 
to auditory and audio-visual speech in adults and 8- to 
1 1 -year-old children, and found that while the same areas 
were involved in perception for both adults and children, 
the relationships between those areas differed. For exam- 
ple, the functional connectivity between pSTS and frontal 
pre-motor regions was stronger for adults given audio- 
visual over auditory-only speech, but weaker for children. 

With regard to latency, a different pattern emerged for 
the children, as a group, compared to the adult sample in 
Experiment 1. For the children, only the P2 component 
exhibited latency modulation in response to visual speech 
cues, and latency shortening was observed regardless of 
congruency between auditory and visual cues. Interpre- 
tations of previous adult data (Pilling, 2009; Van 
Wassenhove et al., 2005) have rested on the effect of 
congruence-dependency, with congruent visual cues sug- 
gested to allow a prediction of the upcoming auditory 
signal, such that the degree of latency shortening reflects 
the difference between expected and perceived events. 
The current developmental data are not sensitive to 
congruency, and therefore cannot be interpreted entirely 
with recourse to the prediction of signal content. The 
present and previous adult data may therefore not tell the 
whole story regarding latency modulation. One possibil- 
ity is that visual cues are involved in predicting not just 
what is about to be presented, but also when it is to be 
presented. Certainly, using non-speech stimuli, the audi- 
tory Nl and P2 components have been shown to be 
sensitive to both the content and timing of stimulus 
presentation (Viswanathan & Jansen, 2010). In this case, 



children of the age range tested here may use visual cues 
to predict the timing but not the content of the upcoming 
auditory signal. 

The idea that visual speech cues may allow a predic- 
tion of when important information in the auditory 
stream will be presented has been proposed before under 
the 'peak hstening' hypothesis (Kim & Davis, 2004). This 
theory states that visual speech cues predict when in the 
auditory signal energetic peaks will occur, which are 
particularly beneficial when processing speech in noise. If 
the shortening of latency does represent two predictive 
measures, then future work should reveal that latency 
shortening is sensitive to manipulations of both predict- 
ability of content and timing of the auditory signal 
relative to visual cues. Age did not interact with 
Condition with respect to latency modulation, so no 
change in the ability to predict the upcoming auditory 
stimulus emerged over this developmental window. The 
influence of visual speech cues on the latency of auditory 
components from age 6 may therefore represent an 
aspect of audio-visual speech perception that is contin- 
uous from infancy despite the U-shaped behavioural 
trajectory outhned in the introduction. However, the 
change in congruency dependence must occur after the 
age of 12, possibly revealing a much later sensitivity to 
upcoming auditory content. 

Over developmental time, a main effect of age on 
component latency was revealed, indicating that children 
process these stimuli more rapidly as they get older. 
Auditory ERP responses are known to show a gradual 
course of developmental change and maturation over 
childhood and adolescence (Bishop, Hardiman, Uwer & 
von Suchodeltz, 2007; Lippe, Kovacevic & Mcintosh, 
2009). It is hard to tease apart the extent to which these 
changes result from the slow physiological maturation of 
the auditory cortex (Moore, 2002), or changes in 
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cognitive processes functionally underlying the activity 
or, more likely, a complex interaction between the two. 

General discussion 

The aim of this study was to chart the trajectory of the 
modulation of auditory ERP components by visual 
speech cues over developmental time. We first validated 
a new child-friendly paradigm using adult participants, 
which replicated previous findings of congruence-depen- 
dent shortening of ERP component latency and congru- 
ence-independent attenuation of component amplitude. A 
greater attenuation of amplitude emerged given mis- 
matched visual speech cues, suggesting that attenuation 
may represent competition between inputs from different 
sensory modalities. This competition may be important 
for the process of evaluating the nature of multisensory 
stimuli in order to determine whether information across 
modahties refers to the same object or event. We have 
shown that the modulation of auditory ERP components 
by visual speech cues gradually emerges over develop- 
mental time, maturing at around the age when behavioural 
studies have revealed a use of visual cues in speech 
perception tasks. Notably though, the additional sensi- 
tivity to incongruent visual cues seen in adults was not 
evident in this developmental sample. 

Regarding latency shortening, our adult results repli- 
cated previous findings, supporting the notion that latency 
modulation represents the process of predicting the 
content of the upcoming auditory signal, the predictive 
coding hypothesis. However, data from our child sample 
showed latency shortening for the P2 component regard- 
less of the congruence between auditory and visual signals. 
We have therefore suggested that latency shortening may 
represent two predictive processes, relating to both the 
content and timing of the upcoming auditory signal, but 
that children within the age range tested here are not yet 
able to make content predictions. 

Overall, these data support and extend previous 
studies pointing to the influence of visual cues on 
processing auditory speech. We have supported the 
notion that amplitude and latency modulation represent 
different aspects of audio-visual signal processing, but 
reinterpreted those data in the light of our new 
paradigm, and the developmental results. Furthermore, 
we have presented new data revealing that these 
responses gradually emerge over childhood. 

Study limitations and outstanding questions 

This study was successful in its aim to develop a child- 
friendly ERP paradigm for the study of audio-visual 



speech, but was limited in a number of respects. The age 
range tested here, although relatively wide, was not 
sufficient to fully trace the development of the electro- 
physiological markers of audio-visual speech perception 
into adulthood. Another limitation, in terms of being 
able to draw firm conclusions, was that the audio-visual 
Mismatch stimuli used here were all irreconcilably 
incongruent. While this led to an interesting finding 
when compared to previous studies with adults, it might 
also have changed the strategy of participants. As 
matched and mismatched multisensory stimuli were 
randomly intermixed within each block, participants 
may have adopted more of a 'wait and see' strategy than 
they would under more naturalistic settings. One way for 
future studies to address whether this factor had a 
significant impact on the results would be to separate 
conditions by block. 

Finally it should be noted that all the stimuli here were 
presented under conditions of no notable auditory noise. 
This factor may turn out to substantially impact on 
electrophysiological data given that dynamic functional 
changes in connectivity have been recorded between 
unisensory cortices and the STS as a function of noise 
(Nath & Beauchamp, 2011). This modulation is thought 
to reflect changes in the weighting of information from 
each sensory modality, and should be considered in 
future electrophysiological investigations. 

One question that has emerged from the current work 
is exactly what the development of electrophysiological 
responses represents at the level of information process- 
ing. The data on amplitude modulation presented here fit 
well with the behavioural data examining the gross 
benefit of visual cues to children. However, the modula- 
tion of component latency was evident at younger ages, 
and certainly the use of visual cues in infancy suggests 
that the process is one of continuous change rather than 
simply 'coming online' later in childhood. This develop- 
mental profile may represent changes in how visual 
speech cues are utilized in childhood with increasing 
experience and cortical maturation. For example. Fort 
and colleagues (Fort, Spinelli, Savariaux & Kandel, 2012) 
found that during a vowel monitoring task both adults 
and 5- to 10-year-old children, as a group, benefited from 
the availability of visual speech cues, but only adults 
showed an additional benefit of lexicality. These authors 
suggest that where adults use visual cues to help retrieve 
lexical information, children use the same cues to process 
the phonetic aspects of speech. Over developmental time, 
then, children may first use visual speech cues to aid 
phonetic processing, and later to aid comprehension. 

This critical issue of the changing relationship between 
brain and behaviour over development needs to be 
addressed with further electrophysiological exploration 
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in conjunction with more sensitive behavioural methods 
aimed at elucidating the different potential uses of visual 
speech cues. The exploration of audio-visual speech over 
childhood is important not just for typically developing 
children learning about the world in auditory noise, but 
also critically for those children growing up with 
developmental language disorders, for whom multisen- 
sory cues may contain valuable information to assist 
language development. 
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